NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

HEAT: A Highly Efficient and Affordable Training System for Collaborative Filtering Based Recommendation on CPUs

https://doi.org/10.1145/3577193.3593717

Zhang, Chengming; Smith, Shaden; Sun, Baixi; Tian, Jiannan; Soifer, Jonathan; Yu, Xiaodong; Song, Shuaiwen Leon; He, Yuxiong; Tao, Dingwen (June 2023, ACM)
OpenFold: retraining AlphaFold2 yields new insights into its learning mechanisms and capacity for generalization

https://doi.org/10.1038/s41592-024-02272-z

Ahdritz, Gustaf; Bouatta, Nazim; Floristean, Christina; Kadyan, Sachin; Xia, Qinghui; Gerecke, William; O’Donnell, Timothy J; Berenberg, Daniel; Fisk, Ian; Zanichelli, Niccolò; et al (August 2024, Nature Methods)

Full Text Available
MAPA: multi-accelerator pattern allocation policy for multi-tenant GPU servers

https://doi.org/10.1145/3458817.3480853

Ranganath, Kiran; Suetterlein, Joshua D.; Manzano, Joseph B.; Song, Shuaiwen Leon; Wong, Daniel (November 2021, SC'21: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis)

Multi-accelerator servers are increasingly being deployed in shared multi-tenant environments (such as in cloud data centers) in order to meet the demands of large-scale compute-intensive workloads. In addition, these accelerators are increasingly being inter-connected in complex topologies and workloads are exhibiting a wider variety of inter-accelerator communication patterns. However, existing allocation policies are ill-suited for these emerging use-cases. Specifically, this work identifies that multi-accelerator workloads are commonly fragmented leading to reduced bandwidth and increased latency for inter-accelerator communication. We propose Multi-Accelerator Pattern Allocation (MAPA), a graph pattern mining approach towards providing generalized allocation support for allocating multi-accelerator workloads on multi-accelerator servers. We demonstrate that MAPA is able to improve the execution time of multi-accelerator workloads and that MAPA is able to provide generalized benefits across various accelerator topologies. Finally, we demonstrate a speedup of 12.4% for 75th percentile of jobs with the worst case execution time reduced by up to 35% against baseline policy using MAPA.
more » « less
Full Text Available
Brief Industry Paper: The Necessity of Adaptive Data Fusion in Infrastructure-Augmented Autonomous Driving System

https://doi.org/10.1109/RTAS54340.2022.00031

Liu, Shaoshan; Wang, Jianda; Wang, Zhendong; Yu, Bo; Hu, Wei; Liu, Yahui; Tang, Jie; Song, Shuaiwen Leon; Liu, Cong; Hu, Yang (May 2022, RTAS 2022)

Full Text Available
Shift-BNN: Highly-Efficient Probabilistic Bayesian Neural Network Training via Memory-Friendly Pattern Retrieving

https://doi.org/10.1145/3466752.3480120

Wan, Qiyu; Xia, Haojun; Zhang, Xingyao; Wang, Lening; Song, Shuaiwen Leon; Fu, Xin (October 2021, IEEE/ACM International Symposium on Microarchitecture)

Full Text Available
COMET: a novel memory-efficient deep learning training framework by using error-bounded lossy compression

https://doi.org/10.14778/3503585.3503597

Jin, Sian; Zhang, Chengming; Jiang, Xintong; Feng, Yunhe; Guan, Hui; Li, Guanpeng; Song, Shuaiwen Leon; Tao, Dingwen (December 2021, Proceedings of the VLDB Endowment)

Deep neural networks (DNNs) are becoming increasingly deeper, wider, and non-linear due to the growing demands on prediction accuracy and analysis quality. Training wide and deep neural networks require large amounts of storage resources such as memory because the intermediate activation data must be saved in the memory during forward propagation and then restored for backward propagation. However, state-of-the-art accelerators such as GPUs are only equipped with very limited memory capacities due to hardware design constraints, which significantly limits the maximum batch size and hence performance speedup when training large-scale DNNs. Traditional memory saving techniques either suffer from performance overhead or are constrained by limited interconnect bandwidth or specific interconnect technology. In this paper, we propose a novel memory-efficient CNN training framework (called COMET) that leverages error-bounded lossy compression to significantly reduce the memory requirement for training in order to allow training larger models or to accelerate training. Our framework purposely adopts error-bounded lossy compression with a strict error-controlling mechanism. Specifically, we perform a theoretical analysis on the compression error propagation from the altered activation data to the gradients, and empirically investigate the impact of altered gradients over the training process. Based on these analyses, we optimize the error-bounded lossy compression and propose an adaptive error-bound control scheme for activation data compression. Experiments demonstrate that our proposed framework can significantly reduce the training memory consumption by up to 13.5X over the baseline training and 1.8X over another state-of-the-art compression-based framework, respectively, with little or no accuracy loss.
more » « less
Full Text Available
Toward efficient interactions between Python and native libraries

https://doi.org/10.1145/3468264.3468541

Tan, Jialiang; Chen, Yu; Liu, Zhenming; Ren, Bin; Song, Shuaiwen Leon; Shen, Xipeng; Liu, Xu (August 2021, The 29th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE 2021))
null (Ed.)
Full Text Available
A Novel Memory-Efficient Deep Learning Training Framework via Error-Bounded Lossy Compression

https://doi.org/10.1145/3437801.3441597

Jin, Sian; Li, Guanpeng; Song, Shuaiwen Leon; Tao, Dingwen (February 2021, The 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP 2021))

DNNs are becoming increasingly deeper, wider, and non-linear due to the growing demands on prediction accuracy and analysis quality. Traditional memory saving techniques such as data recomputation and migration either suffers from a high performance overhead or is constrained by specific interconnect technology and limited bandwidth. In this paper, we propose a novel memory-driven high performance CNN training framework that leverages error-bounded lossy compression to significantly reduce the memory requirement for training in order to allow training larger neural networks. We evaluate our design against state-of-the-art solutions with four widely-adopted CNNs and the ImangeNet dataset. Results demonstrate that our proposed framework can significantly reduce the training memory consumption by up to 13.5x and 1.8x over the baseline training and state-of-the-art framework with compression, respectively, with little or no accuracy loss. The full paper can be referred to at https://arxiv.org/abs/2011.09017.
more » « less
Full Text Available
TSM2X: High-performance tall-and-skinny matrix-matrix multiplication on GPUs

https://doi.org/10.1016/j.jpdc.2021.02.013

Rivera, Cody; Chen, Jieyang; Xiong, Nan; Zhang, Jing; Song, Shuaiwen Leon; Tao, Dingwen (February 2021, Journal of Parallel and Distributed Computing)
null (Ed.)
Full Text Available
Q-VR: system-level design for future mobile collaborative virtual reality

https://doi.org/10.1145/3445814.3446715

Xie, Chenhao; Li, Xie; Hu, Yang; Peng, Huwan; Taylor, Michael; Song, Shuaiwen Leon (January 2021, Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems)
null (Ed.)
Full Text Available

« Prev Next »

Search for: All records